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Abstract —The problem of detecting a small number of outliers 
in a large dataset is an important task in many fields from fraud 
detection to high-energy physics. Two approaches have emerged 
to tackle this problem: unsupervised and supervised. Supervised 
approaches require a sufficient amount of labeled data and are 
challenged by novel types of outliers and inherent class imbalance, 
whereas unsupervised methods do not take advantage of available 
labeled training examples and often exhibit poorer predictive 
performance. We propose BORE (a Bagged Outlier Representa¬ 
tion Ensemble) which uses unsupervised outlier scoring functions 
(OSEs) as features in a supervised learning framework. BORE 
is able to adapt to arbitrary OSE feature representations, to the 
imbalance in labeled data as well as to prediction-time constraints 
on computational cost. We demonstrate the good performance 
of BORE compared to a variety of competing methods in the 
non-budgeted and the budgeted outlier detection problem on 12 
real-world datasets. 


I. Introduction 

The aim of outlier or anomaly detection is to identify 
points in a dataset which deviate in some way from the usually 
observed patterns. Outliers typically represent a small portion 
of the data and can be very diverse in nature. 

In applications where the semantics of outliers are known 
in advance (e.g., detection of fraud, intrusions or mislabeled 
data), labeled training examples for supervised outlier detec¬ 
tion may be available uni. Because outliers are naturally 
scarce, such data will be heavily imbalanced which poses a 
problem for most classifiers. Another challenge for supervised 
outlier detection is the heterogeneity of the outlier class. 
This makes generalising from a small number of labeled 
samples difficult. Eurthermore, supervised methods are unable 
to detect novel types of anomalies for which no labeled training 
examples have been collected. 

Unsupervised algorithms for outlier detection (see e.g. El 
for a review) are suitable for purely exploratory tasks where 
very few or no labeled examples of outliers are available. 
These methods are based mainly on geometric properties of 
the data and typically assign a real-valued “outlierness” score 
to each data point. As such unsupervised algorithms are suited 
to detecting new types of outliers. However, they often exhibit 
poor predictive performance ifTSl . 

Recently, semi-supervised methods which modify unsuper¬ 
vised approaches to take advantage of labeled examples have 
been shown their promise ifTSll . 


In this work we take an entirely different approach to 
combining the strengths of unsupervised and supervised outlier 
detection; we propose a supervised algorithm that first learns a 
feature representation which successfully differentiates outliers 
from inkers using the unlabeled data. Our method is not only 
an entirely novel approach that is able to make use of both 
unsupervised and supervised information, but also a simple and 
easily generalizable framework that allows incorporation of 
different outlier detection methods. It avoids the considerable 
effort in tuning parameters in existing work (e.g. kernels for 
non-linear feature transformations), and handles class imbal¬ 
ance in a straightforward fashion. Its final output can be easily 
interpreted as outlier probabilities. Einally, our algorithm can 
adapt to computational budgets at prediction time, providing 
good detection performance within user defined budget con¬ 
straints. 

Our contribution is two-fold; 

1) We propose a Bagged Outlier Representation En¬ 
semble (BORE); a unified framework for incorpo¬ 
rating unsupervised and supervised data for outlier 
detection. BORE first learns a representation of the 
outlierness of each point in an unsupervised fashion 
which are then used as features in a classifier trained 
on imbalanced data. 

2) We consider the case where computational resources 
at prediction time are limited and introduce a feature 
selection technique that respects a computational bud¬ 
get while retaining good predictive performance. 

The key idea underlying BORE is to view the output 
scores of unsupervised outlier scoring function (OSE) algo¬ 
rithms as non-linear transformations of the original feature 
space. Crucially, this new set of features provides a richer 
representation which better distinguishes outliers from normal 
points. This representation is then used for supervised learning, 
where we adopt an ensemble approach that elegantly handles 
class imbalance. Thus, an advantage of BORE is that its 
performance can be boosted by including a larger or more 
diverse set of outlier detectors - particularly those which are 
known to be suited to the task at hand. This idea complements 
a recent line of research on outlier ensembles that strives 
to combine outlier scoring functions m, ED in an entirely 
unsupervised manner. 

Since each feature is the output of an OSE learned on 
the whole dataset, adding features is costly. While this can 
be easily parallelized at training time, at prediction time, this 


issue is compounded with the fact that OSFs need to be ran 
on the entire dataset including the new unseen test points in 
order to make predictions. Clearly this can be prohibitively 
expensive for large numbers of OSFs. This is particularly 
important when the prediction must be performed under time 
or computational constraints such as a real-time or embedded 
system. To overcome this, we introduce a budget-aware feature 
selection approach which identifies a small subset of the 
OSFs that represent the best tradeoff between computational 
budget and prediction accuracy. Therefore at prediction time 
we obtain good performance which only requires computing 
a subset of features. This is a considerable improvement over 
existing approaches to representation learning that require large 
amounts of resources at both training and prediction time. 

Paper outline. In the following section we briefly review 
current approaches to unsupervised and supervised outlier 
detection. We then detail BORE, our novel approach for 
learning representations for outlier detection. In order to deal 
with computational budgets at prediction time we then pro¬ 
pose a budget-aware variable selection procedure. Finally, we 
present extensive empirical results on 12 real-world datasets 
demonstrating the predictive outlier detection performance of 
BORE both with and without budget constraints. 

II. Related Work 

Feature representations. In recent years, the held of 
representation learning has become increasingly popu¬ 
lar. In particular, a wide class of techniques based on deep 
neural networks have been proposed which learn rich feature 
representations of input data in a supervised or unsupervised 
fashion which can then be used for prediction. Such feature 
learning approaches have become the keystone of achieving 
state-of-the-art performance in a variety of problem domains. 
Elsewhere, less complex feature representations can be learned 
using correlations between features which have been shown 
to greatly improve prediction when few labeled examples are 
available m- Flowever, these methods require vast amounts of 
training data to learn good representations which are typically 
not available in outlier detection problems. 

The key idea in feature learning is that good representations 
of the data can be obtained by means of solving an unsuper¬ 
vised learning problem. In this work, we leverage this idea 
with the addition of domain knowledge encoded in the zoo 
of existing specialised OSFs. In this respect, training OSFs 
on a particular dataset and using their outputs as input to a 
supervised learning problem can be viewed as unsupervised 
learning of a suitable representation for many possible types 
of outliers. 

Outlier Detection and Class-Imbalance Learning. Out¬ 
lier detection naturally faces the problem of class imbalance. 
Therefore, for the supervised case, well-established approaches 
from class-imbalance learning can be adopted including sam¬ 
pling, bagging and boosting, one-class classification and cost- 
sensitive learning (see e.g. in ||2|, llJTl '). Our proposed tech¬ 
nique is not dependent on any of these models and can readily 
complement each of them. 

Outlier Ensembles. An appropriate combination of mul¬ 
tiple unsupervised outlier scoring functions into ensembles 
can increase outlier detection performance m,m- However, 


building an ensemble is difficult in completely unsupervised 
settings and only heuristic approaches have been proposed 
so far 1301 . 1281 . Open questions concern, e.g., the tradeoff 
between accuracy of single outlier scoring functions and their 
diversity or the normalization and the combination function of 
the outlier scores 1^ . No supervised approaches have been 
studied yet in this context except for an initial idea for a semi- 
supervised ensemble presented in ||26l. Our proposed approach 
could be viewed as an ensemble selection technique guided by 
the available training data, providing an elegant solution to the 
above stated problems. 

Semi-supervised Outlier Detection. Semi-supervised 
techniques make use of both labeled and unlabeled data 
for training. Recently, a semi-supervised anomaly detector 
(SSAD) was proposed ifT^ . While BORE incorporates the 
unsupervised information in its features, SSAD is based on an 
unsupervised technique, support vector data description l34l . 
which learns a hypersphere enclosing the normal data, and 
uses this as a regularize!' for supervised learning. It requires to 
specify an appropriate kernel on input. The goal is, similarly 
to BORE, to achieve a good performance using labels while 
retaining the possibility of revealing novel anomalies through 
the unsupervised information. We experimentally compare 
BORE to this technique. 

III. Learning a Representation 
FOR Outlier Detection 

In this section we will detail our basic framework for out¬ 
lier detection. First we describe the feature space construction 
which we use to leam a good representation of the outliers in 
the data in an unsupervised manner. We then use these features 
as input to a supervised learning procedure which adapts to 
the heterogeneity and class imbalance inherent in the outlier 
detection problem. 

A. Outlier Scoring Functions 

An outlier is broadly characterized as a point that deviates 
in some way from the rest of the data. The exact form of 
deviation depends on the data and the application. Diverse 
detection and scoring functions have been proposed including 
approaches based on statistical methods, PCA, and other 
subspace analysis methods 0, El, EH, Ea. A large body of 
work is based on the analysis of distances and density around 
data points, e.g. via fcNN distance or relative local density 
ESI, 121. Our method supports any such detector that provide 
outlier scores. 

For the discussion that follows, we require the following 
definition of an outlier scoring function which will form the 
basis of our feature representation. 

Definition 1 (Outlier Scoring Functions). Given a data matrix 
X ^ X G an outlier scoring function (OSF) is a 

mapping ^ : X —>■ K". That is, an OSF assigns a real valued 
output to each row of the data matrix corresponding to the 
degree of outlierness of each point. 

The scale and interpretation of different scoring approaches 
may vary. For example, they may be normalised between 0 and 
1 so as to be interpreted as a probability or may be thresholded 
to assign binary labels to points. This makes standard ensemble 
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Fig. 1: t-SNE visualisation of three different bags each subsampled from the Letter, (|^ Ionosphere and 0 Pageblocks 
data sets. The top hne shows the original features and the bottom line shows the corresponding learned outlier 
representation. In each bag, the outlier representation exhibits a better separation between outlying points (shown 
in red) and non-outlying points (blue). 


approaches to combining OSFs difficult and highly dependent 
on scaling and normalization. 

B. Feature Space Construction 

Individually, the OSFs detailed above are limited in their 
ability to detect multiple types of outliers. The key insight we 
provide is that when combined, the output of multiple outlier 
detectors provide a good feature representation which can be 
used to detect many outlier types. 

Let $ = {$ 1 ,..., be a set of outlier scoring functions 
as in Definition Each S $ is applied to the data, X and 
returns a vector G M”. Our feature representation is 

then 

<i>(x) = [<i>i(x),..-,$„,(x)]. (1) 

Instead of the original data set, we now work with the 
transformed data set <1>(X) in the OSF feature space. Each 
individual can be viewed as a feature vector of the 

data. 

To construct the set of functions $, we may use any 
existing unsupervised OSE where the goal is to capture diverse 
aspects of the outliers in a particular dataset. It is therefore 
beneficial to expand the feature space by applying each OSE 
under a set of perturbations. Eor example multiple parameter 
settings, different distance metrics and different subspaces of 
the original features. 

In practise the original features of the data also contain 
useful information for identifying outliers. Therefore, we will 
use an augmented version of o where 

$(X) = [X, ..., ^m{X)] G (2) 

where the combined feature space dimension is d = {k + m). 
With some abuse of notation we will denote G as the 

row of the matrix <i>(X) and G K” as the column. 

Using the original data and OSE features, we exploit the 
strength of unsupervised outlier detection to detect novel types 
of outliers within a supervised framework. To this end we 
will refer to the combined feature space in 0 as an outlier 
representation (OR). 


C. Learning Setup 


We are now ready to incorporate supervised information 
on labeled outliers into our method. The outlier representa¬ 
tion learned in the previous section is a highly non-linear 
transformation of the original space. As such we can use a 
linear classifier in this new space to detect outliers. To make 
the discussion concrete, we adopt logistic regression, but in 
practise any linear classifier could be used instead. 

Logistic regression (see e.g. lfT4l ') models the probability of 
a point i being an outlier by means of a binary random variable, 
2 /^*^ G {0,1} conditioned on the outlier representation and 
a parameter vector, G through the logistic function; 


= l|$W;/3) = 


1 -I- exp (—/3^<1>(*)) 


= (3) 


predicting 1 if cr(/3^<l>*^*^) > 0.5 and 0 otherwise. The 
maximum likelihood estimator for /3 is the solution to 


/3 = argmin — log (1 — 

^ i=l 

(4) 

which can be solved efficiently using gradient descent or a 
pseudo second-order method such as L-BFGS llT4ll . 


In practise, any classifier can be used in place of logistic 
regression. However, since we already learn a highly non¬ 
linear feature representation, there is little additional utility 
to be gained by using a non-linear classifier compared with 
the additional cost in optimization and hyper-parameter tuning. 
This is illustrated in Section V-B3 where we present results 
comparing BORE with a non-linear method, SSAD. 


Another benefit of logistic regression in the context of 
outlier ensembles is that the output is easily interpreted since 
it directly models the probability of outlierness for a given 
datapoint given the outlier representation. 


Re-sampling and Bagging. There are two challenges for the 
straightforward application of a standard classifier to outlier 




detection: The inherent class imbalance problem and the 
heterogeniety of the outlier class. We deal with both of these 
issues in a unified manner by adapting a standard re-sampling 
method: bootstrap aggregating, or bagging ||7l. Bagging con¬ 
structs B bags consisting of uniformly subsampling datapoints 
with replacement. It then averages the output of the models 
trained on each of the bags. 

Since uniform sampling would result in bags containing 
very few outliers, instead we use biased sampling to construct 
bags with balanced classes. For every bag b = 1,... ,B, an 
equal number of outliers and inliers are sampled uniformly 
from their respective populations. Since the number of outliers 
is small compared with size of the inlier class, the same outliers 
will be appear in multiple bags while the inlier class will be 
substantially different between bags. 

Due to the inhomogeneity in the data, a different aggre¬ 
gation scheme such as Stacking or maximin aggregation 
ifTOl could be considered. However, we found empirically that 
neither performed as well as standard bagging. 


Algorithm 1 Orthogonal Matching Pursuit. 

Input: Data: G y G K”}, # non-zeros: 7 

Initialize: A = {}, = y, /3° = 0 

1 : for t = 1... 7 do 

2: j= argmax^.^^ 

3: A i — U 

4: /3^ : solve with $_ 4 . 

5: R* =y-a{l3lJ^A) 

6: end for 
Output: /S'’' 


IV. Outlier Detection on a Budget 

BORE learns a representation of arbitrary types of outliers 
by combining the output of many different outlier detectors in 
a bagging framework. Crucially, when the original dimension¬ 
ality and the number of samples is large, learning this repre¬ 
sentation may be computationally intensive. This is typically 
not a problem since learning the outlier representation and 
training classifiers on individual bags can easily be computed 
in parallel. However, at test time when new instances must be 
classified this computational burden can be problematic. 

As a concrete example, the pre-trained outlier detection 
system might be deployed on less powerful hardware or might 
have to classify a point as an outlier under time constraints. 
The more OSFs used in the representation, the more resources 
required at test time to compute the representation of the new 
points. 

To overcome this problem, we propose a classification 
strategy under a computational budget. 

A. Cost-aware Feature Selection 

Associated with each feature transformation is a 

computational cost c(j). For example, this cost could be 
directly related to the time or space complexity required to 
compute a particular feature. Some of the features may be 
too computationally expensive to generate at test time and 


equivalent predictive performance might be obtainable by 
instead combining a number of cheaply computed features. 
Our goal is good detection of outliers for any budget C such 
that the sum of costs of utilised features 
select features that are expected to have high utility for the 
outlier detection task while incurring low cost. In order to do 
so we adopt a strategy that also handles potential issues when 
the number of OSF features grows. 

A challenge is statistical estimation in high dimensions (i.e. 
when the dimensionality of the feature space approaches the 
number of samples) where the maximum likelihood estimator 
for P in 0 is ill-defined. This becomes a problem when the 
set of OSFs becomes large. We tackle the budgeted detection 
problem and the high-dimensional estimation problem in a 
unified manner by considering the set of 7 —sparse models 
{P '■ ||/3||o < 7} where 7 is a positive integer and ||/3||o is the 
£0 “norm” which counts the number of non-zero elements in 

/ 3 - 

We enforce this condition by updating our estimate of P in 
a coordinate-wise manner using Orthogonal Matching Pursuit 
(OMP) as described in Algorithm [T] OMP for logistic regres¬ 
sion is an iterative algorithm which starts with a candidate 
solution P^ consisting of the zero vector f23i . At each step 
t it adds a single non-zero coordinate to the solution. This 
coordinate j is selected according to the criterion in line 2 
and is added to the set of selected coordinates, A ^ {j} U A. 
i?* = y—a{P*_^^^A) is the vector of residual errors at iteration 
t. The model is then updated by solving the logistic regression 
problem as in line 4 where the subscript A considers only the 
coordinates indexed in A and all others remain zero. 


These steps are repeated for t = 1,..., 7 , that is until there 
are at most 7 non-zero elements in p. 


Recently, a variant of OMP, budgeted OMP, which takes 
feature cost into account has been proposed lfT3]l . The only 
difference is to line 2 of Algorithm [T] Given an active set A of 
selected features, the next feature is instead selected according 
to: 


j = arg max 


c{j) ■ ' 


This introduces a trade-off between the utility of a particular 
feature (in terms of reducing the residual training error) and 
the cost of that feature. As such when |7l| = d, we also obtain 
an ordering of the features according to this trade-off. This 
allows predictions to be made at test-time to fit a particular 
computational budget C by selecting the top g features such 
that Yfj=ic{Aj) < C, where Aj is the element added 
to the active set. El show that for a given budget, budgeted 
OMP returns a solution which is close to optimal. 


B. Stability Selection 

As explained in Section III-C[ since the outlier detection 
problem is highly imbalanced, bagging with non-uniform 
sampling is necessary to learn a good classifier. This presents 
the problem that due to random fluctuations introduced in the 
sub-sampling step, the order in which features are selected by 
budgeted OMP in each bag may be different. The question of 
which features to include in the final model is answered by 
stability selection 1^ . Stability selection is a model selection 











TABLE I: Datasets used for evaluation. Outlier ratio in % (r), dataset size (n), dimensionality (d). 
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r 

n 
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Cardio 

5 

1734 

21 

Heart Disease 

10 

166 

13 

Hepatitis 

9 

74 

19 

Higgs 

5 

5000 

30 

Ionosphere 

36 

351 

32 

Letter 

6 

1600 

32 

Pageblocks 

5 

5139 

10 

Parkinson 

20 

60 

22 

Pima 

5 

526 

8 

Spambase 

2 

2579 

57 

Waveform 

3 

3443 

21 

Wilt 

5 

4839 

5 


based on comparing multiple models trained on subsamples 
of the data. Under certain conditions for a linear model (and 
assuming uniform n/2 bootstrap sampling), it has been shown 
that the set of features returned by stability selection for OMP 
correspond to the true underlying model ll25l . In order to deal 
with our budget constraint, we present a slightly modified 
version of stability selection. 


it is kNN im, kNN-weight 0, ODIN HB), LOF g), Sim- 
plifiedLOF |l32l, COF 123, INFLO Hb), LoOP US), LDOF 
ES), LDF Ell, KDEOS ED and FastABOD EQ10 Each 
OSE depends on a neighbourhood parameter which we set as 
k G {1,10, 20,..., 100} (or less for smaller data sets). Each 
value of k results in a distinct OSE. In total we obtain 71-132 
transformed features per data set. 


Eor each bag b = 1,B we obtain the solution vector 
which satisfies a given budget constraint C using budgeted 
OMP which we denote as The set of selected features is 
then = {j : ^0, j = 1,, d}. Eor each feature 

j we count the proportion of times it was selected across the 
bags as Jj = g I{7 G where I is the binary 

indicator fonction. 

Now, in order to ensure that the constraint is satisfied in the 
final model, we construct a stable set of features as the set of 
most commonly selected features across all bags which satisfy 
the budget constraint. Denoting n( J) as the permutation which 
sorts the elements of J in descending order, the stable set is 
then 

Sc = {j-. Y. c(j)<C}. 

jen(j) 

That is, we add features to Sc in order corresponding to 
how often they were selected across the B bags, until their 
combined cost matches the specified budget C. 

Einally, a classifier in each bag is trained using the set of 
features indexed by Sc- Although the theoretical guarantees 
about the final model no longer hold due to our imbalanced 
sampling procedure, we find that our modified stability selec¬ 
tion procedure performs well empirically. 

V. Experimental Evaluation 
A. Datasets and Features 

We evaluate the performance of BORE on 12 real world 
datasets summarised in Table The dataset Higgs consists 
of the training set of the Higgs Boson Machine Learning 
Challeng^ where the goal was to distinguish measurements of 
Higgs boson particles from the background. We subsampled 
Higgs bosons such that they form a minority class. The 
rest of the datasets come from UCI ii and they were also 
preprocessed (by subsampling one or multiple classes) such 
that they are suitable for the outlier detection taslj^ We split 
each dataset into 60% training and 40% testing data. 

Outlier Scoring Functions. We use a range of distance and 
density-based OSEs as our feature transformations. Precisely, 

* https://www.kaggle.eom/c/higgs-boson 

^Datasets and detailed information about preprocessing available from 
http://www.dbs. ifi. Imu. de/research/outlier-evaluation 


Eor two data sets. Letter and Higgs, we alternatively use 
kNN and LOE combined with feature bagging ll22l . That is, 
we compute the OSEs in different subspaces of the original 
domain, resulting in 50 and 40 transformed features for these 
datasets, respectively. 


We begin with a visual analysis of the learned outlier repre¬ 
sentations (ORs). Eigure [T] compares the original features (top 
row) and the OR (bottom row) for three datasets. Eor the vi¬ 
sualisation we applied t-distributed Stochastic Neighbourhood 
Embedding (t-SNE) ll35l to bags of points subsampled accord¬ 
ing to the procedure described in Section III-C| t-SNE finds 
a low-dimensional non-linear embedding of high-dimensional 
data which groups similar points together and enforces greater 
distances between dissimilar points. The visualisations reveal 
that the learned ORs provide a better separation between the 
outlying points and the non-outlying points. In contrast, in the 
original feature space the outliers tend to be more uniformly 
distributed amongst the non-outlying points. This suggests that 
classifiers trained on the bags consisting of the OR should 
achieve higher accuracy than those trained on the original 
features. 


B. Learning Outlier Representations 

1) Algorithms: To demonstrate the effectiveness of BORE, 
we perform extensive comparisons with a diverse set of 
state-of-the-art approaches, both supervised and unsupervised. 
Greedy ensemble (GE) is a purely unsupervised technique of 
combining OSEs EOll . This baseline uses all OSEs but no 
label information. The Best OSE baseline is a single OSE that 
exhibits best performance in hindsight. It should be noted that 
this cannot be realised in practice since it requires knowing 
the out-of-sample performance of each method a priori. Mean 
of OSEs simply averages the output of all OSEs. SSAD ifT^ 
is a semi-supervised outlier detection technique using exactly 
the same label information as BORE. 

To evaluate our bagging procedure for dealing with im¬ 
balance and heterogeneity in the outlier class, we also run 
SSAD on the data augmented by the new outlier represen¬ 
tations SSADh-OR. To judge the effectiveness of the outlier 
representation, we compare against bagged ensemble (BE) 
which builds a bagged model in a manner identical to BORE 
except only in the original feature space (i.e. it does not learn 
the outlier representations). 

^Any other OSF (fitting Def.[^ could also be used. 
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2 ) Experimental Setup: For both BORE and BE we con¬ 
struct 50 balanced bags from the training set by subsampling 
70% of the labeled outliers and an equivalent number of 
labeled normal points into each bag. We use BORE and 
BE without £2 regularization. Empirically, we observe that 
the bagging procedure performs implicit regularization and 
the additional £2 penalty is chosen as 0 by cross validation. 
Importantly, this means that for a fixed number and size of 
bags, BORE requires no additional tuning parameters. 

Eor SSAD and SSADh-OR, we require setting the Gaus¬ 
sian kernel width cr, and a regularization parameter, n. We 
selected the optimal parameters using 5-fold cross validation 
for a = [0.005,5] and k = [0.5,20]. SSAD requires the 
setting of further parameters which control the regularization 
of outlier and inkers separately as well as unlabeled data. We 
leave these at their default values. We scale the data between 
[0,1] for BORE and normalize by standard deviation for SSAD 
(this provides the best empirical performance). Eor GE, we 
initialize the target vector with the ground truth number of 
outlier candidates—in reality this is typically not available. 

3) Results: Eigure shows receiver operator characteristic 
(ROC) curves comparing BORE with each of the competitors 
on 6 datasets. Table HU contains results on all 12 datasets and 
it compares each of the methods in terms of standard outlier 
detection measures (as used e.g. in II3, 12), namely; area 
under the ROC curve (AUC), area under the false positive rate 
interval [0,0.1] of the ROC curve (AUC 0.1) and precision@no 
where rio is the ground-truth number of outliers. 

BORE is the best performing method in all three measures 
for 6 datasets and performs best in at least one of the measures 
in 9 of the 12 datasets. Eor the remaining datasets, BORE is 
always among the best performing methods. The order of the 
competing methods changes. 

As expected, the methods which use label information 
outperform the unsupervised methods. Crucially, BORE almost 
always outperforms BE which implies that the representation 


learned by combining OSEs is better for outlier detection 
than using the original features as also suggested by the 
visualisations in Eigure [T] 

However, combining our outlier representation with SSAD 
does not always yield improved performance. This is perhaps 
due to the additional feature transformation that SSAD per¬ 
forms using the Gaussian kernel. The use of different kernel 
functions may improve the performance of SSADh-OR but was 
not explored. Since the OR feature space is already a non¬ 
linear transformation of the data, it is not clear which kernel 
function would be appropriate in combination with the OSEs 
to improve performance. Alternatively, multiple kernel learning 
could be used to hnd a good combination of kernels but this 
would increase the number of tuning parameters. In this regard, 
BORE is far less sensitive to its hyperparameters (bag size 
and number of bags, shown below) whereas SSAD is highly 
sensitive to kernel width and regularization strength. 

Greedy ensemble is often worse than the Mean of OSEs 
but it only selects a small number of OSEs (between 1 and 
13). 

Sensitivity to parameters In Eig.j^ we compare the 
sensitivity of BORE and SSAD to their respective parameter 
settings in terms of AUC. Recall that BORE requires setting 
on the number of bags and their size whereas SSAD requires 
a regularization parameter, k and the kernel width, a. The 
AUC achieved by BORE is similar among all parameter values 
already for a small number of bags. On the other hand, 
the performance of SSAD varies signihcantly for different 
parameters, highlighting the need for cross-validation. 

C. Budgeted Outlier Detection 

1) Algorithms and Experimental Setup: We now evaluate 
the proposed feature selection technique for BORE in the 
setting where a prediction-time budget is imposed. We report 
ROC AUC and AUC under the beginning of the ROC curve 
(AUC 0.1). Due to space limitations we omit the results for 





































































TABLE II: Results of outlier detection for all 12 data sets (in %). Algorithms marked with * are unsupervised, others 
are supervised. AUC refers to the area under the ROC curve, AUC 0.1 is the area under the beginning of the ROC 
curve (fpr interval [0, O.I]) and precision@no is precision at the rio-th position in the outlier ranking where Uo is the 
ground-truth number of outUers in the data set. Bold denotes the best performance for a given evaluation measure and 
dataset. 


cardio 

AUC AUC 0.1 precision@no 

heartdisease 

AUC AUC 0.1 precision@no 

AUC 

hepatitis 

AUC 0.1 

precision @no 

Best OSF’' 

79.08 

26.00 

27.78 

88.14 

40.00 

50.00 

93.10 

0.00 

0.00 

Mean of OSFs* 

75.75 

31.19 

33.33 

79.03 

15.00 

37.50 

79.31 

0.00 

0.00 

GE* 

74.75 

13.84 

13.89 

71.82 

7.50 

25.00 

79.31 

0.00 

0.00 

SSAD 

92.50 

48.66 

47.22 

78.81 

27.50 

37.50 

93.10 

0.00 

0.00 

SSAD-pOR 

92.49 

50.97 

52.78 

86.44 

37.50 

50.00 

96.55 

50.00 

0.00 

BE 

95.77 

65.72 

55.56 

84.96 

52.50 

62.50 

93.10 

0.00 

0.00 

BORE 

95.98 

66.83 

63.89 

88.35 

65.00 

75.00 

100.00 

100.00 

100.00 


higgs 

AUC AUC 0.1 precision@no 

iono 

AUC AUC 0.1 precision @no 

letter 

AUC AUC 0.1 precision@no 

Best OSF* 

62.30 

11.19 

12.24 

94.22 

68.00 

82.00 

91.16 

44.15 

38.89 

Mean of OSEs* 

56.95 

2.63 

3.06 

93.38 

73.89 

84.00 

89.41 

35.79 

36.11 

GE* 

56.31 

5.53 

8.16 

12.23 

0.00 

6.00 

84.38 

19.23 

25.00 

SSAD 

70.75 

20.98 

21.43 

96.29 

88.13 

92.00 

97.22 

75.41 

66.67 

SSAD-pOR 

73.55 

26.55 

25.51 

95.41 

86.56 

88.00 

96.22 

75.80 

69.44 

BE 

73.40 

26.55 

25.51 

83.74 

55.11 

70.00 

84.88 

35.09 

33.33 

BORE 

81.47 

37.95 

35.71 

97.47 

91.11 

94.00 

96.42 

76.70 

66.67 


pima 

AUC AUC 0.1 precisionOrio 

pageblocks 

AUC AUC 0.1 precision@no 

parkinson 

AUC AUC 0.1 precision@no 

Best OSF" 

71.27 

11.13 

15.38 

91.13 

40.61 

37.25 

91.58 

40.00 

60.00 

Mean of OSEs* 

64.82 

0.81 

0.00 

90.10 

42.96 

37.25 

83.16 

40.00 

40.00 

GE^ 

59.63 

0.00 

0.00 

29.36 

5.90 

6.86 

85.26 

40.00 

60.00 

SSAD 

62.53 

4.05 

0.00 

96.46 

72.41 

62.75 

74.74 

40.00 

60.00 

SSAD-pOR 

66.20 

6.62 

7.69 

97.48 

81.99 

69.61 

88.42 

20.00 

60.00 

BE 

67.50 

11.94 

15.38 

94.23 

73.22 

62.75 

76.84 

60.00 

60.00 

BORE 

69.99 

17.81 

23.08 

97.93 

83.28 

68.63 

91.58 

80.00 

80.00 


spambase 

AUC AUC 0.1 precisionOrio 

waveform 

ROC AUC AUC 0.1 precision@no 

wilt 

AUC AUC 0.1 precision@no 

Best OSF’^ 

85.68 

42.16 

36.84 

75.16 

33.82 

31.11 

80.52 

20.19 

17.76 

Mean of OSEs* 

85.15 

44.54 

36.84 

73.07 

33.62 

28.89 

74.66 

3.94 

3.74 

GE* 

83.38 

43.94 

31.58 

70.33 

22.78 

17.78 

70.40 

0.00 

0.00 

SSAD 

96.14 

72.95 

42.11 

91.84 

59.82 

53.33 

98.56 

85.78 

74.77 

SSAD-i-OR 

96.78 

65.65 

21.05 

87.73 

37.72 

28.89 

96.17 

77.78 

72.90 

BE 

94.21 

66.59 

52.63 

88.07 

40.85 

33.33 

98.03 

78.26 

66.36 

BORE 

95.50 

67.12 

47.37 

91.98 

54.77 

40.00 

98.38 

87.29 

76.64 



Fig. 3: Sensitivity of ROC AUC values to different choices 
of tuning parameters (Ionosphere). 


precision @no but note that this measure exhibits similar trends 
to AUC 0.1. 

We compare our proposed cost-aware feature selection 
scheme BORE-Budget against BORE-OMP which uses the 
unmodified Algorithm for feature selection. Note that none 
of the competitors from the previous experiments are designed 
to take into account feature costs. Therefore, as a baseline 
we show the average detection performance of BORE using 


a random subset of features selected such that the budget 
constraint is satisfied (BORE-random). 

To every OSE, we assign a feature cost. We then evaluate 
the detection performance of the three algorithms for a series of 
budgets. We sample feature costs uniformly with replacement 
from a pool of values and randomly assign them to the features. 
The pool is constructed such that the costs correspond to 
different computational complexities: {n, 2n, 5n, n^, 2n^, 3n^, 
n?, 2n^}. Concretely, we instantiate the pool as {10, 20, 50, 
100, 200, 300, 1000, 2000} for all data sets. We report on 
average performance over 20 different random assignments of 
costs to features. The original features always have a uniform 
cost of 1 in our experiments. 

Having assigned costs, we apply OMP and budgeted OMP 
to get a feature ranking. We then evaluate outlier detection 
performance for budgets ranging from 10 (the minimal cost of 
a transformed feature) to the maximum which is the sum of 
all feature costs. Eor every budget, each method selects a set 
of features such that the budget constraints are satisfied. For 
the baseline BORE-random, we randomly select a subset of 
features which satisfy the budget constraint and report average 

































(c) Hepatitis. (d) Higgs. 


Fig. 4: Evaluation of outlier detection on budget (first four data sets, see below for remaining data sets). Budgeted OMP 
is compared to standard OMP and random selection of features. For each data set, we report on ROC AUC and ROC 
AUC on the false positive rate interval [0,0.1] (y-axis) for different budgets (a;-axis, log-scaled). 


performance over 20 random subsamples. 

The number of bags and outlier subsampling ratio for 
BORE are the same as for the previous experiments (50 and 
70%, respectively). 

2) Results: Results are shown in Fig.|^ In 9 of the 12 
datasets, for any given budget BORE-Budget typically achieves 
a larger AUC and AUC 0.1 than the competing methods. This 
is expected since BORE-Budget selects features taking their 
cost into account. 

In the setting with a single bag, the AUC is expected to 
increase monotonically ina - i.e. as the budget increases, the 
performance of the method should always improve. The lack of 
sctrict monotonicity exhibited by BORE-Budget and BORE- 
OMP is explained due to the effects of bagging and stability 
selection which are necessary to provide good performance 
in the highly imbalanced setting. However BORE-Budget 
exhibits smoother behaviour than the competitors as the budget 
changes. 

The poor performance of random selection underlines 
the importance of a principled feature selection procedure. 
Interestingly, the relatively monotonic behaviour of the random 
baseline as more features are added to the model underscores 
that BORE benefits from using a larger number of OSEs 
as feature transformations. This further emphasises that in 
the absence of strict computational constraints a large and 
diverse set of OSEs should be chosen to construct the outlier 
representation. 


VI. Conclusion 

We have introduced BORE, an approach to outlier detection 
which combines unsupervised and supervised techniques in 
order to build a rich representation of the outliers in the 
data. One of the main benefits of BORE is its simplicity, 
which takes advantage of decades of existing research in 
designing outlier scoring functions to result in a powerful 
algorithm which is insensitive to tuning parameters. BORE 
is based on effective supervised learning methods that are 
well studied, and leverages the recent wisdom that learning 
a good representation is of utmost importance to training a 
simple yet highly predictive model. In this manner, we propose 
an entirely new way of integrating unsupervised information 
into supervised outlier detection. We have shown that BORE 
outperforms existing unsupervised and supervised methods on 
a wide range of real world datasets. 

Another key benefit of the BORE framework is its gen¬ 
erality and extendibility. For example, newly developed OSEs 
can easily be incorporated as part of the feature representation. 
Specific domain knowledge can also be encoded implicitly by 
the choice of OSEs. Furthermore, existing supervised outlier 
detection techniques could be complemented using the BORE 
framework. This removes some of the guesswork inherent in 
designing non-linear feature transformations (i.e. kernels) for 
specific tasks. We have tested this hypothesis empirically with 
SSADh-OR and we have observed improved performance over 
standard SSAD for half of the data. 

In the context of recent research on outlier ensembles 

































































Figure 4 (Continued): Evaluation of outlier detection on budget. The spikes in performance are due to using a single 
stable set of features, Sc across all bags which could differ greatly from the individual active sets in each bag due to 
randomness introduced by subsampling. The smoother performance of BORE-Budget can be explained by the cost-aware 
feature selection procedure ensuring that since features are selected based on both cost and utility, the stable set is more 
similar to the active sets in each bag. Smoother performance might be obtained by downweighting the contribution of 
bags whose active sets differ too greatly from Sc instead of simple averaging. 





































































































































m, BORE can be viewed as the first supervised ensemble 
technique. It learns weights for OSFs and depending on the 
specific classifier used, its final output can be easily interpreted 
as a probability. In contrast, existing unsupervised outlier 
ensembles struggle with proper normalization of the OSFs 
outputs and are thus difficult to interpret lfT9ll . BORE 
avoids this problem by learning appropriate thresholds between 
inliers and outliers from the training data. 

Finally, we concentrate on the problem of reducing the 
computational cost at test-time. We envisage a scenario where 
resources at training time are plentiful. For example, the OSFs 
can be trained in parallel (which is only required once for the 
full dataset) as can the supervised models on each subsam¬ 
pled bag of data. For context, most successful approaches to 
representation learning require large amounts of resources at 
training and test time (i.e. deep networks). BORE is the only 
method capable of handling budget constraints at test time. We 
have shown that it successfully selects a subset of features that 
provide good overall outlier detection performance. 
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